Main Chemometrics Applications
- Design of experiments (before data acquisition, mind works better than software here)
- Chemical data preprocessing
- Multivariate exploratory analysis
- Classification
- Modelling
- Multiple Curve Resolution
“Chemometrics is the chemical discipline that uses mathematical, statistical, and other methods employing formal logic to design or select optimal measurement procedures and experiments, and to provide maximum relevant chemical information by analyzing chemical data.”
Vékey, K., Telekes, A., & Vertes, A. (Eds.). (2011). Medical applications of mass spectrometry. Elsevier.
chemometrics, https://cran.r-project.org/web/packages/chemometrics/index.htmlChemometricsWithR, https://github.com/rwehrens/ChemometricsWithR/ChemoSpec, https://cloud.r-project.org/web/packages/ChemoSpec/index.htmlmdatools, https://cran.r-project.org/web/packages/mdatools/index.htmlUpdated information can be found following the CRAN Task View: Chemometrics and Computational Physics, (https://cran.r-project.org/web/views/ChemPhys.html).
There are many other packages that will useful for specific tasks. These will appear along the session.
pckgs<-c("tidyverse","ggthemes","GGally",
"ggalt","gtools","gridExtra")
pckgs2Install<-pckgs[!(pckgs %in% library()$results[,1])]
pckgs2Load<-pckgs[!(pckgs %in% (.packages()))]
for(pckg in pckgs2Install) {
install.packages(pckg,
repos="https://cloud.r-project.org/",
quiet=TRUE, type="binary")
}
for(pckg in pckgs2Load) {
library(pckg, character.only = TRUE)
}Many times raw data needs to be processed to be analyzed and compared to preexisting data.
Common processing includes
In the case of chemical data, this is specially true is for spectral data.
There are different types of spectral data
Let’s explore this process with several data sets:
ptw::gaschrom,pls::gasoline,msdata package, andbaseline package.## [1] 931
## [1] 578 2
## 'data.frame': 45 obs. of 2 variables:
## $ cow : num 0 0.25 0.375 0.875 0.5 0.75 0.5 0.125 0 0.125 ...
## $ spectra: num [1:45, 1:21451] 1029 371 606 368 554 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : NULL
## .. ..$ : chr [1:21451] "4999.94078628963" "5001.55954267662" "5003.17856106153" "5004.79784144435" ...
## - attr(*, "terms")=Classes 'terms', 'formula' language cow ~ spectra
## .. ..- attr(*, "variables")= language list(cow, spectra)
## .. ..- attr(*, "factors")= int [1:2, 1] 0 1
## .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. ..$ : chr [1:2] "cow" "spectra"
## .. .. .. ..$ : chr "spectra"
## .. ..- attr(*, "term.labels")= chr "spectra"
## .. ..- attr(*, "order")= int 1
## .. ..- attr(*, "intercept")= int 1
## .. ..- attr(*, "response")= int 1
## .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
## .. ..- attr(*, "predvars")= language list(cow, spectra)
## .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "nmatrix.21451"
## .. .. ..- attr(*, "names")= chr [1:2] "cow" "spectra"
Main methods for noise removal are:
Running medians, threshold, aggregating repetitions and consensus approaches are also used.
milkMS_sg7 <- milkMS
milkMS_sg7$I <- as.numeric(stats::filter(milkMS_mave$I,
c(-2,3,6,7,6,3,-2)/21))
g1 <- ggplot(milkMS, aes(x=mz, y=I)) +
geom_line() +
coord_cartesian(xlim=c(10000,11000),ylim=c(0,500)) +
theme_classic()
g2 <- ggplot(milkMS_sg7, aes(x=mz, y=I)) +
geom_line() +
coord_cartesian(xlim=c(10000,11000),ylim=c(0,500)) +
theme_classic()
grid.arrange(g1,g2)I_fft <- fft(milkMS$I)
I_fft[513:(length(I_fft)-512)] <- 0
milkMS_fft <- data.frame(mz = milkMS$mz,
I=Re(fft(I_fft, inverse=TRUE)/length(I_fft)))
g1 <- ggplot(milkMS, aes(x=mz, y=I)) +
geom_line() +
theme_classic()
g2 <- ggplot(milkMS_fft, aes(x=mz, y=I)) +
geom_line() +
theme_classic()
grid.arrange(g1,g2)Baseline adjustment is commonly done by finding local minima and fitting a curve to them. This curve is then substracted to the spectrum.
Common methods for baseline adjustment are:
MSC is available in the pls package. ALS and many other methods, such as LOWESS or a polynomial fit, are implemented in the baseline package.
originalspectra <- as.matrix(woodNIR$NIR)
newspectra<-msc(originalspectra)
mydataOrig <- data.frame(id=woodNIR[,5], NIR = originalspectra) %>%
pivot_longer(cols=-1, names_to = "wl", values_to = "Abs") %>%
mutate(wl=as.numeric(substr(wl,6,nchar(wl))))
mydataMSC <- data.frame(id=woodNIR[,5], NIR = newspectra) %>%
pivot_longer(cols=-1, names_to = "wl", values_to = "Abs") %>%
mutate(wl=as.numeric(substr(wl,6,nchar(wl))))
g1 <- ggplot(mydataOrig, aes(x=wl, y=Abs, color=as.factor(id))) +
geom_line() +
theme_classic() + theme(legend.position = "none")
g2 <- ggplot(mydataMSC, aes(x=wl, y=Abs, color=as.factor(id))) +
geom_line() +
theme_classic() + theme(legend.position = "none")
grid.arrange(g1,g2)Different phenomena can provoke misalignments among spectra. In order to compare them peaks must be aligned. In many cases, this is experimentally done with standards or solved through binning; but in some cases, data processing is required, for example in LC chromatograms.
Shifts can be corrected using time warping functions, for example the ptw or the dtw functions in the package with the same names.
## [1] 0.9980289
## PTW object: individual alignment of 1 sample on 1 reference.
##
## Warping coefficients:
## [,1] [,2] [,3]
## [1,] -29.22414 1.029991 -1.041421e-05
##
## Warping criterion: WCC
## Warping mode: forward
## Value: 0.01627874
Especially in MS, binning is a required process to allow for spectra comparison.
dfMS1 <- data.frame(mz = sciexSpec[[1]][,1],
I = sciexSpec[[1]][,2])
dfMS2 <- dfMS1 %>% mutate(mz=round(2*mz)/2) %>%
group_by(mz) %>% summarize(I=sum(I),.groups="drop")
dfMS3 <- dfMS1 %>% mutate(mz=round(mz/2)*2) %>%
group_by(mz) %>% summarize(I=sum(I),.groups="drop")
g1 <- ggplot(dfMS1, aes(x=mz, xend=mz,y=0, yend=I))+
geom_segment() + theme_classic()
g2 <- ggplot(dfMS2, aes(x=mz, xend=mz,y=0, yend=I))+
geom_segment() + theme_classic()
g3 <- ggplot(dfMS3, aes(x=mz, xend=mz,y=0, yend=I))+
geom_segment() + theme_classic()
grid.arrange(g1,g2,g3)It is also common in spectra processing to convert the profile data to a peak list.
This is usually done looking for local maxima and/or applying some threshold consideration.
## tibble [41 x 2] (S3: tbl_df/tbl/data.frame)
## $ mz: num [1:41] 105 107 108 108 109 ...
## $ I : num [1:41] 412 1236 4121 412 2884 ...
## # A tibble: 3 x 2
## mz I
## <dbl> <dbl>
## 1 123 100588
## 2 124 646800
## 3 125 62351
span <- 100
dfMS1_g <- embed(dfMS1$I,span)
dfMS1_gmax <- span + 1 - apply(dfMS1_g,1,which.max)
dfMS1_gmax[dfMS1_gmax==1 | dfMS1_gmax==span] <- NA
dfMS1_peaksPos <- dfMS1_gmax + 0:(length(dfMS1_gmax)-1)
dfMS1_peaksPos <- unique(na.omit(dfMS1_peaksPos))
dfMS1$mz[dfMS1_peaksPos]## [1] 114.0914 122.0956 123.0919 124.0860 125.0903 125.0966 126.4697 126.5886
## [9] 126.6474 126.7394 126.7760 126.9332 126.9571 126.9666 128.1406 129.9880
In many cases, scaling can change the outcomes of an analysis, so it is important to decide what scaling makes sense. Some examples can be
Although analysis of leverage points, dtection of outliers and imputation of missing data are always relevant, we will not discuss it further here. Just keep in mind there is always a necessary step in any data analysis.
All this process is many times done to compare chemical entities: spectra, molecules, mixtures… In any of this cases, decisions will need to be made in order to decide when to data match and/or how distance must be calculated between two data instances.
Some examples of strategies for matching data are
Whether it comes from spectral data or not, it is common to have multivariant data in chemical analysis. In fact, chemometrics is commonly restricted to the studya and analysis of multivariant data.
Common multivariant techniques in chemometrics are
We will discuss here some aspects of multivariant exploratory analysis.
We’ll use the classical Forina’s wines dataset, which is included in the kohonen package.
## 'data.frame': 177 obs. of 13 variables:
## $ alcohol : num 13.2 13.2 14.4 13.2 14.2 ...
## $ malic.acid : num 1.78 2.36 1.95 2.59 1.76 1.87 2.15 1.64 1.35 2.16 ...
## $ ash : num 2.14 2.67 2.5 2.87 2.45 2.45 2.61 2.17 2.27 2.3 ...
## $ ash.alkalinity : num 11.2 18.6 16.8 21 15.2 14.6 17.6 14 16 18 ...
## $ magnesium : num 100 101 113 118 112 96 121 97 98 105 ...
## $ tot..phenols : num 2.65 2.8 3.85 2.8 3.27 2.5 2.6 2.8 2.98 2.95 ...
## $ flavonoids : num 2.76 3.24 3.49 2.69 3.39 2.52 2.51 2.98 3.15 3.32 ...
## $ non.flav..phenols: num 0.26 0.3 0.24 0.39 0.34 0.3 0.31 0.29 0.22 0.22 ...
## $ proanth : num 1.28 2.81 2.18 1.82 1.97 1.98 1.25 1.98 1.85 2.38 ...
## $ col..int. : num 4.38 5.68 7.8 4.32 6.75 5.25 5.05 5.2 7.22 5.75 ...
## $ col..hue : num 1.05 1.03 0.86 1.04 1.05 1.02 1.06 1.08 1.01 1.25 ...
## $ OD.ratio : num 3.4 3.17 3.45 2.93 2.85 3.58 3.58 2.85 3.55 3.17 ...
## $ proline : num 1050 1185 1480 735 1450 ...
## alcohol malic.acid ash ash.alkalinity
## Min. :11.03 Min. :0.74 Min. :1.360 Min. :10.60
## 1st Qu.:12.36 1st Qu.:1.60 1st Qu.:2.210 1st Qu.:17.20
## Median :13.05 Median :1.87 Median :2.360 Median :19.50
## Mean :12.99 Mean :2.34 Mean :2.366 Mean :19.52
## 3rd Qu.:13.67 3rd Qu.:3.10 3rd Qu.:2.560 3rd Qu.:21.50
## Max. :14.83 Max. :5.80 Max. :3.230 Max. :30.00
## magnesium
## Min. : 70.00
## 1st Qu.: 88.00
## Median : 98.00
## Mean : 99.59
## 3rd Qu.:107.00
## Max. :162.00
## tot..phenols flavonoids non.flav..phenols proanth
## Min. :0.980 Min. :0.340 Min. :0.1300 Min. :0.410
## 1st Qu.:1.740 1st Qu.:1.200 1st Qu.:0.2700 1st Qu.:1.250
## Median :2.350 Median :2.130 Median :0.3400 Median :1.550
## Mean :2.292 Mean :2.023 Mean :0.3623 Mean :1.587
## 3rd Qu.:2.800 3rd Qu.:2.860 3rd Qu.:0.4400 3rd Qu.:1.950
## Max. :3.880 Max. :5.080 Max. :0.6600 Max. :3.580
## col..int.
## Min. : 1.280
## 1st Qu.: 3.210
## Median : 4.680
## Mean : 5.055
## 3rd Qu.: 6.200
## Max. :13.000
## col..hue OD.ratio proline
## Min. :0.480 Min. :1.270 Min. : 278.0
## 1st Qu.:0.780 1st Qu.:1.930 1st Qu.: 500.0
## Median :0.960 Median :2.780 Median : 672.0
## Mean :0.957 Mean :2.604 Mean : 745.1
## 3rd Qu.:1.120 3rd Qu.:3.170 3rd Qu.: 985.0
## Max. :1.710 Max. :4.000 Max. :1680.0
corWines <- data.frame(cor(dfWines)) %>%
rownames_to_column("var1") %>%
pivot_longer(2:14, names_to="var2", values_to = "cor") %>%
filter(var1<var2) %>%
arrange(desc(abs(cor)))
str(corWines)## tibble [78 x 3] (S3: tbl_df/tbl/data.frame)
## $ var1: chr [1:78] "flavonoids" "flavonoids" "OD.ratio" "flavonoids" ...
## $ var2: chr [1:78] "tot..phenols" "OD.ratio" "tot..phenols" "proanth" ...
## $ cor : num [1:78] 0.864 0.786 0.7 0.65 0.641 ...
## # A tibble: 6 x 3
## var1 var2 cor
## <chr> <chr> <dbl>
## 1 flavonoids tot..phenols 0.864
## 2 flavonoids OD.ratio 0.786
## 3 OD.ratio tot..phenols 0.700
## 4 flavonoids proanth 0.650
## 5 alcohol proline 0.641
## 6 proanth tot..phenols 0.611
## # A tibble: 6 x 3
## var1 var2 cor
## <chr> <chr> <dbl>
## 1 magnesium malic.acid -0.0490
## 2 magnesium OD.ratio 0.0470
## 3 col..int. proanth -0.0271
## 4 ash.alkalinity col..int. 0.0205
## 5 ash proanth 0.00808
## 6 ash OD.ratio 0.00150
It consists in searching for a limited number of factor that can conserve most of the data variability.
The most common way to perform this analysis is
##
## Kaiser-Meyer-Olkin Statistics
##
## Call: REdaS::KMOS(x = dfWines)
##
## Measures of Sampling Adequacy (MSA):
## alcohol malic.acid ash ash.alkalinity
## 0.7251534 0.7965202 0.4383797 0.6790194
## magnesium tot..phenols flavonoids non.flav..phenols
## 0.6667702 0.8722492 0.8138751 0.8208894
## proanth col..int. col..hue OD.ratio
## 0.8535643 0.6213451 0.7878646 0.8644313
## proline
## 0.8138733
##
## KMO-Criterion: 0.7767008
##
## Bartlett's Test of Sphericity
##
## Call: REdaS::bart_spher(x = cor(dfWines))
##
## X2 = 396.936
## df = 78
## p-value < 2.22e-16
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## Standard deviation 2.1628220 1.5815708 1.2055413 0.96148018 0.92829777
## Proportion of Variance 0.3598307 0.1924128 0.1117946 0.07111109 0.06628744
## Cumulative Proportion 0.3598307 0.5522435 0.6640381 0.73514919 0.80143663
## Comp.6 Comp.7 Comp.8 Comp.9 Comp.10
## Standard deviation 0.80302411 0.74295478 0.59223207 0.53775461 0.49679842
## Proportion of Variance 0.04960367 0.04246014 0.02697991 0.02224462 0.01898528
## Cumulative Proportion 0.85104030 0.89350044 0.92048035 0.94272497 0.96171025
## Comp.11 Comp.12 Comp.13
## Standard deviation 0.47480542 0.41033745 0.322412350
## Proportion of Variance 0.01734155 0.01295206 0.007996133
## Cumulative Proportion 0.97905180 0.99200387 1.000000000
## List of 7
## $ sdev : Named num [1:13] 2.163 1.582 1.206 0.961 0.928 ...
## ..- attr(*, "names")= chr [1:13] "Comp.1" "Comp.2" "Comp.3" "Comp.4" ...
## $ loadings: 'loadings' num [1:13, 1:13] 0.13789 -0.24638 -0.00432 -0.23738 0.135 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:13] "alcohol" "malic.acid" "ash" "ash.alkalinity" ...
## .. ..$ : chr [1:13] "Comp.1" "Comp.2" "Comp.3" "Comp.4" ...
## $ center : Named num [1:13] 12.99 2.34 2.37 19.52 99.59 ...
## ..- attr(*, "names")= chr [1:13] "alcohol" "malic.acid" "ash" "ash.alkalinity" ...
## $ scale : Named num [1:13] 0.807 1.116 0.274 3.327 14.134 ...
## ..- attr(*, "names")= chr [1:13] "alcohol" "malic.acid" "ash" "ash.alkalinity" ...
## $ n.obs : int 177
## $ scores : num [1:177, 1:13] 2.23 2.53 3.75 1.02 3.05 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : NULL
## .. ..$ : chr [1:13] "Comp.1" "Comp.2" "Comp.3" "Comp.4" ...
## $ call : language princomp(x = dfWines, cor = TRUE)
## - attr(*, "class")= chr "princomp"
## $loadings
##
## Loadings:
## Comp.1 Comp.2 Comp.3
## alcohol 0.538
## malic.acid -0.224 0.137 -0.221
## ash 0.179 0.162 -0.656
## ash.alkalinity -0.198 -0.628
## magnesium 0.135 0.285 -0.165
## tot..phenols 0.409 0.122
## flavonoids 0.446
## non.flav..phenols -0.219 -0.254
## proanth 0.338
## col..int. -0.203 0.512
## col..hue 0.342 -0.213
## OD.ratio 0.430 -0.112
## proline 0.169 0.442 0.101
##
## Comp.1 Comp.2 Comp.3
## SS loadings 1.000 1.000 1.000
## Proportion Var 0.077 0.077 0.077
## Cumulative Var 0.077 0.154 0.231
##
## $rotmat
## [,1] [,2] [,3]
## [1,] 0.9244318 0.2184311 0.3125918
## [2,] -0.1295040 0.9508070 -0.2814155
## [3,] -0.3586844 0.2196676 0.9072440
## $loadings
##
## Loadings:
## Comp.1 Comp.2 Comp.3
## alcohol 0.546
## malic.acid -0.231 0.139 -0.188
## ash 0.170 -0.682
## ash.alkalinity -0.244 -0.629
## magnesium 0.132 0.262 -0.184
## tot..phenols 0.412
## flavonoids 0.450
## non.flav..phenols -0.225 -0.222
## proanth 0.340
## col..int. -0.210 0.525
## col..hue 0.349 -0.234
## OD.ratio 0.436 -0.146
## proline 0.170 0.436
##
## Comp.1 Comp.2 Comp.3
## SS loadings 1.023 1.013 1.023
## Proportion Var 0.079 0.078 0.079
## Cumulative Var 0.079 0.157 0.235
##
## $rotmat
## [,1] [,2] [,3]
## [1,] 0.9380762 0.1661701 0.1789950
## [2,] -0.1411449 0.9414354 -0.2598980
## [3,] -0.3507245 0.3153516 0.9609461
## Oblique rotation method Oblimin Quartimin converged.
## Loadings:
## Comp.1 Comp.2 Comp.3
## alcohol 0.0231 0.5398 0.0666
## malic.acid -0.2459 0.1492 -0.1997
## ash 0.0975 0.1298 -0.6809
## ash.alkalinity -0.0878 -0.2141 -0.6116
## magnesium 0.1227 0.2692 -0.1960
## tot..phenols 0.4063 0.0874 -0.0833
## flavonoids 0.4438 0.0249 -0.0607
## non.flav..phenols -0.2536 -0.0675 -0.2208
## proanth 0.3312 0.0456 -0.0951
## col..int. -0.1890 0.5254 -0.0574
## col..hue 0.3434 -0.2381 0.0653
## OD.ratio 0.4237 -0.1460 -0.0329
## proline 0.1979 0.4292 0.0553
##
## Rotating matrix:
## [,1] [,2] [,3]
## [1,] 0.966 0.151 0.186
## [2,] -0.129 0.952 -0.316
## [3,] -0.228 0.272 0.931
##
## Phi:
## [,1] [,2] [,3]
## [1,] 1.00000 0.0387 -0.00713
## [2,] 0.03868 1.0000 0.01911
## [3,] -0.00713 0.0191 1.00000
The most common clustering strategies are
dfWinesScaled <- data.frame(scale(dfWines))
clWines_hca <- hclust(stats::dist(dfWinesScaled,
method="euclidean"),
method="ward.D2")
str(clWines_hca)## List of 7
## $ merge : int [1:176, 1:2] -9 -131 -11 -15 -92 -34 -16 -164 -20 -22 ...
## $ height : num [1:176] 1.16 1.19 1.21 1.22 1.24 ...
## $ order : int [1:177] 158 159 153 175 176 167 171 164 172 156 ...
## $ labels : NULL
## $ method : chr "ward.D2"
## $ call : language hclust(d = stats::dist(dfWinesScaled, method = "euclidean"), method = "ward.D2")
## $ dist.method: chr "euclidean"
## - attr(*, "class")= chr "hclust"
## winesGroups_hca
## 1 2 3
## 63 58 56
dfWinesScaled <- data.frame(scale(dfWines))
clWines_km <- kmeans(dfWinesScaled, centers=3)
str(clWines_km)## List of 9
## $ cluster : int [1:177] 3 3 3 3 3 3 3 3 3 3 ...
## $ centers : num [1:3, 1:13] 0.174 -0.918 0.833 0.864 -0.395 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:3] "1" "2" "3"
## .. ..$ : chr [1:13] "alcohol" "malic.acid" "ash" "ash.alkalinity" ...
## $ totss : num 2288
## $ withinss : num [1:3] 326 559 382
## $ tot.withinss: num 1268
## $ betweenss : num 1020
## $ size : int [1:3] 51 65 61
## $ iter : int 2
## $ ifault : int 0
## - attr(*, "class")= chr "kmeans"
## winesGroups_km
## 1 2 3
## 51 65 61
## [1] 0.8518456
dfWinesClust <- cbind(dfWines,
hca=factor(winesGroups_hca),
km=factor(winesGroups_km))
g1 <- ggplot(dfWinesClust,
aes(x=flavonoids, y=alcohol, color=hca)) +
geom_point(shape=3) + geom_encircle() + theme_classic() +
theme(legend.position = "top")
g2 <- ggplot(dfWinesClust,
aes(x=flavonoids, y=alcohol, color=km)) +
geom_point(shape=3) + geom_encircle() + theme_classic() +
theme(legend.position = "top")
grid.arrange(g1,g2,ncol=2)